Detect HTLC-resolving on-chain actions and pass them to ChannelManager #269

TheBlueMatt · 2018-12-11T20:08:36Z

This is #198 rebased and with a number of substantial and many non-substantial changes. The full diff against a simple rebase on master is available at the 0bin link, but some changes include:

failing backwards HTLCs after a remote revoked commitment tx broadcast based on last two remote commitment tx HTLC contents, instead of last two local commitment tx HTLC contents (with a new test showing why this is required),
move updating of HTLCs in ChannelManager into the event getter instead of block_connected, reducing a bunch of duplicates, although this is also a shitty situation, I think its a bit better and we can resolve it with ChainInterface is easy to misuse #80.

https://0bin.net/paste/icbkyeuL3WD7nhVA#tLgGthEsvsid0wHky0SKVW3-0+TpTmru409h/X6A/sr

Aims to detect onchain resolution of channel Modify in consequence test_txn_broadcast to still pass channel_monitor_network_test Modify some tests due to block re-scan caused by detections extensions

ariard

Real thank to catch the case with outbound HTLCs removed from local commitment tx due to multiple commitment_signed without RAA sent by remote peer. This kind of limit case let me wander if there others cases like this strolling in ChannelMonitor branches, so obviously the whole onchain-2-offchain stuff, once merged, need more reviews/hardening.
Generally agree with the whole, except of being skeptical of 06c46ec but as you noted it has to be refactored latter.

ariard · 2018-12-12T01:31:26Z

src/ln/channelmonitor.rs

+						let mut existing_claim = false;
+						e.get_mut().retain(|htlc_data| {
+							if htlc.0 == htlc_data.0 {
+								if htlc_data.1.is_some() {


Ah yes thanks for correcting, I think I told to myself that it was unlikely that an output was first solved by a claim then secondly by a timeout but in fact it's completely doable if HTLC is timeout and remote try to claim after this while local broadcasting HTLC-timeout tx. For the TODO, which delay seems to you reasonable ? BOLT 5 suggests 100 blocks to consider an output as irrevocably resolved, and if so maybe we should add a time stamp height in pending_htlc_updated ?

Well we definitely cant want for 100 blocks cause we have to fail backwards before then. For now I think 6 is probably fine. I added the constant for it in a new commit with appropriate checking.

Good, do you push the commit on your local branch, head commit is fb29935 currently ?

ariard · 2018-12-12T02:04:18Z

src/ln/channelmanager.rs


 impl events::MessageSendEventsProvider for ChannelManager {
 	fn get_and_clear_pending_msg_events(&self) -> Vec<events::MessageSendEvent> {
+		// TODO: Event release to users and serialization is currently race-y: its very easy for a


Hmmm not a big fan of it, I mean it's putting more responsibility on the client to call both events cleaners at right time and maybe screwing some backward claims in case of contention upstream if cltv delta is short or if client stuck in other processing ? Seems to me that being block-driven guarantee us that we will do our best to fulfill backward HTLCs, but agree need more thoughts with #80 refactor

Well any of the actual results are going to need to wait on an event clear anyway. eg if it results in sending an update_fulfill_htlc we still have to wait for them to call get_and_clear_pending_msg_events. It may even make sense to move the transaction broadcasting stuff into events as the whole broadcaster API is pretty awkward.

As for the race condition here - either we can serialize the pending events or we can just fail serialization if there are pending events and let the user go in a loop until there are none left, which I think is probably ok.

Ah yes, so I think it's fine, acknowledging it's a best effort for now, need to spend a little bit more time with rust-lightning-bitcoinrpc to have more clues on how the broadcast/events APIs could be used

ariard · 2018-12-12T02:41:13Z

src/ln/channelmanager.rs

+		}
+		if deliver_bs_raa {
+			match events[0] {
+				MessageSendEvent::UpdateHTLCs { ref node_id, updates: _ } => {


nit: maybe specify it's a update_add_htlcs there?

ariard · 2018-12-12T02:43:21Z

src/ln/channelmanager.rs

+
+				if !deliver_bs_raa {
+					// If we delievered B's RAA we got an unknown preimage error, not something
+					// that we should update our rouring table for.


nit: heyy what a rouring table is used for?

heh, fixed.

ariard · 2018-12-12T02:49:28Z

src/ln/channelmanager.rs

+				if let Some(preimage) = htlc_update.payment_preimage {
+					self.claim_funds_internal(self.channel_state.lock().unwrap(), htlc_update.source, preimage);
+				} else {
+					self.fail_htlc_backwards_internal(self.channel_state.lock().unwrap(), htlc_update.source, &htlc_update.payment_hash, HTLCFailReason::Reason { failure_code: 0x4000 | 10, data: Vec::new() });


Believe it's mine but in fact HTLC Output Handling: Local Commitment, Local Offers rationale in BOLT 5 seems to indicate that failure_code should be permanent_channel_failure instead of unknown_next_peer here?

Ah, hmm, can we fix this in a separate PR cause it needs to be fixed in a few other places too, I think.

Ok, will open a separate PR so

TheBlueMatt · 2018-12-13T02:48:42Z

Pushed a big chunk rewrite as we need to consider not-included-in-commitment-transaction-due-to-dust-limit HTLCs as well. This has significant implications for the design layout here, so I went ahead and fixed those, though I did not actually go implement all the bits that will be needed for correctness. I added TODOs in a few places noting this, however.

Insert it in current_local_signed_tx, prev_local_signed_tx, remote_claimable_outpoints. For so get it provided by Channel calls to provide_latest_{local,remote}_tx

Called in ChannelMonitor block_connected, returning HTLCUpdate upstream via ManyChannelMonitor to link htlcs between monitors. Used by ChannelManager to fulfill/fail htlcs backwards accordingly If spurrious HTLCUpdate are generated due to block re-scan and htlc are already LocalRemoved, discard them in channel get_update_*_htlc

Pass failure backward

TheBlueMatt · 2018-12-13T16:54:53Z

Also further loosened the debug_assert!()s in Channel get_update_*_htlc functions, since they were still too strict and fuzzer was finding ways to hit them.

ariard

Okay seems with this Channel refactors we should have enough HTLCSource data to fail backward trimmed HTLC but honestly we shouldn't leave about too much news TODOs (461501e), to avoid further unseen pitfalls and complexities in onchain-2-offchain stuff..

ariard · 2018-12-14T02:20:12Z

src/ln/channel.rs

-							// Return the new channel monitor in a last-ditch effort to hit the
-							// chain and claim the funds
+							log_warn!(self, "Have preimage and want to fulfill HTLC with pending failure against channel {}", log_bytes!(self.channel_id()));
+							// TODO: We may actually be able to switch to a fulfill here, though its


Hmm so we broadcast our Local Commitment Tx, plus HTLC-Timeout Tx, ChannelMonitor detect it's a failure, at same moment remote send us a update_fulfill_htlc ? Agree shouldn't happen often but tried to clean the TODO at ariard@f70314b

Well after we do the delay-fail-backwards stuff until we get a few confirmations this will only be able to happen if the remote side hasnt sent us an RAA in several blocks and we get a several block reorg, so I think its probably rare enough to not be worth it. But if you want to add a test for that, feel free :p.

Hmm not sure that the remote sending us or not its RAA changes anything, just him forwarding us update_fulfill_htlc later than the safe delay-fail-backwards mean we lost opportunity to claim HTLC backward. Agree it should only happen with reorg > delay-fail-backward, so rare enough but maybe fun to write a test on it, will wait to write delay-fail-backward stuff first :)

This case is only the holding cell, so we can only get here if we went to fail backwards initially and couldn't as we were waiting on an RAA. If we weren't waiting on an RAA we would have failed backwards immediately and have no recourse if a reorg happens.

Ah yes ! Was thinking quickly we would only fail backward after reorg delay exhaustion, but that would slow down the network so much so really dumb, doesn't make sense

Oops, I think we're talking past each other a bit, too many different channels and things involved...no, I think we should delay failing backwards until reorg delay exhaustion...once things have hit the chain we're gonna be mega slow compared to regular operation anyway, being slower still isn't much of a hit.

The case here is where an upstream channel hit the chain and, after it claimed an HTLC with a timeout, we waited a few blocks to make sure there wasn't gonna be a reorg, then failed it backwards. When we went to fail it backwards, we found that the downstream channel was in AwaitingRAA (ie it hasn't send us an RAA after we sent it something), so we added the failure to the holding cell. Then, we actually got a really surprising reorg and suddenly had a claim on the original, upstream, on-chain output, resulting in a new fulfill event! When that event came through we found that the downstream channel still hadn't provided the needed RAA so we got here.

Phew...that was a lot of rare cases, so probably not worth special (at least untested) code to handle :P.

src/ln/channelmonitor.rs

ariard · 2018-12-14T03:01:36Z

src/ln/channelmonitor.rs

 					});
 					txn_to_broadcast.push(spend_tx);
+
+					// TODO: We need to fail back HTLCs that were't included in the broadcast


So if I get right there is two reasons for which a HTLC (for which we have a coupled-inbound HTLC in backward channel) couldn't be included in remote commitment tx : either it didn't meet dust or commitment tx is an old one, non representative of latest state. What if remote peer broadcast stale tx before we provide_latest_{remote, local}_tx to ChannelMonitor but after we committed to new inbound HTLC on backward channel ?

If we create a transaction for the remote side, we must provide_latest_remote_tx (and update the monitor, and write it all to disk) before we send the signatures required for the remote to broadcast it. For local, we either have to succeed at committing new monitors or we have to fail the channel immediately using the current in-memory monitor (though I think we need to move to failing HTLCs based only on remotes, so this wont matter anyway).

To clarify my question, what if remote peer broadcast its commitment tx while we're done with inbounds htlcs (so we have them in HTLCForwardInfo after Channel::forward_htlcs) but before we call process_pending_htlc_forwards ? I didn't test it, but seems we're unable to fail backward inbound HTLCs...

Then the counterparty would be failing a transaction with inbound HTLCs, not outbound ones, which we aren't really handling in this patch. If they dutifully committed to an HTLC (implying the corresponding ChannelMonitor was updated) and it ended up in our pending HTLC forwards there is nothing wrong with forwarding it, it just means when the upstream node either fails or forwards we need to get that information back to the original ChannelMonitor so it can do the same on chain. Not forwarding it may be nicer since we could avoid the on-chain fee and just let them claim the timeout, but that is a relatively small edge case and adding code to handle it probably isnt worth it.

Okay good, well letting other pay the fee would be a nice extension of what a greedy algorithm means.

ariard · 2018-12-14T03:10:17Z

src/ln/channelmonitor.rs

+		// transaction (or in any of the latest two local commitment transactions). This probably
+		// needs to use the same logic as the revoked-tx-announe logic - checking the last two
+		// remote commitment transactions. This probably has implications for what data we need to
+		// store in local commitment transactions.


Wait which data we need to store more in local commitment tx if we check against latest two remote commitment tx there ?

I think we may be able to store less data.

So the idea is to fail backward htlc at would_broadcast_height so we don't have to extract timeout updates from txs spending our own commitment tx ? Would be smarter effectively, seems we need to put stuff too in ChannelManager::force_close_channel

Nono, I think we need to switch from examining the latest local commitment tx here to examining the latest two remote commitment transactions like we do in check_spend_remote_transaction in case of revoked tx broadcast. This implies we may be able to remove some of the HTLC info from local txn since we won't be using that here.

Hmm let accept we do that, so we would generate htlc_updated in would_broadcast_height based on what is inside latest two remote commitment tx ? If so we would pass backward timeout htlc on a not-yet-broadcast local commitment tx does it raise any problem ? Doing this would effectively mean less stuff in latest local commitment tx and simplify is_resolving_htlc_output

Hmm, to be honest I haven't fully thought through this one, hence why I just slapped on the TODO and moved forward (this thing really needs to get merged...). It may turn out we can't do something so simple, but intuitively it seems like we may be ignoring some HTLCs here where we shouldn't be.

ariard · 2018-12-14T03:17:59Z

src/ln/channelmonitor.rs

+		// TODO: We need to consider HTLCs which were below dust threshold here - while they don't
+		// strictly imply that we need to fail the channel, we need to go ahead and fail them back
+		// to the source, and if we don't fail the channel we will have to ensure that the next
+		// updates that peer sends us are update_fails, failing the channel if not. It's probably


Hmmm but that's not the role of ChannelMonitor to check update_fail so effectively seems easier to fail the channel in case of coming height.

Indeed, there would have to be a bit of plumbing of stuff from monitors to Channels themselves, so I agree, its probably easier to not worry about this edge case and just always fail to chain.

Agree we don't have to broadcast commitment tx because we can't timeout onchain these trimmed HTLCs, there are not presents in the commitment tx but protocol logic ask us to fail backward and ideally we should wait for remote update_fail but we go ahead to avoid layers violation, Maybe remove the TODO and just let a comment?

The TODO is that we need to do something - I think the right something is to just close the channel wholesale if a below-dust HTLC times out, but we need to switch to doing that from the current code which still only looks at the HTLCs present in the commitment tx.

Spoke too fast, yes

This is somewhat awkward, but prevents a slew of duplicate events. Really this should probably be more explicit, but would be easy to move that along with a slew of block_connected-event-processing refactors, see-also GH lightningdevkit#80. This affects full_stack_target only on accident cause the demo test didn't continue onwards with another block connection.

ease readability Conditionnal compilation for weight of second one to handle test special cltv values

Including detection of timeout claims, fulfill claims, and failing all current HTLCs in case of revoked-commitment broadcast.

Fix variable name as payment_hash instead of txid for index of remote_hash_commitment_number in ChannelMonitor reader

In case of broadcast of revoked local commitment tx, we may be interested that we've screwed up

TheBlueMatt mentioned this pull request Dec 11, 2018

Ready for Review : Handle resolution in case of channel going onchain, pass state backward offchain #198

Closed

Track outputs fron local commitment tx

664ae42

Aims to detect onchain resolution of channel Modify in consequence test_txn_broadcast to still pass channel_monitor_network_test Modify some tests due to block re-scan caused by detections extensions

ariard reviewed Dec 12, 2018

View reviewed changes

TheBlueMatt force-pushed the 2018-12-198-review branch from fb29935 to fdf4128 Compare December 13, 2018 02:47

Antoine Riard and others added 4 commits December 13, 2018 11:12

Track HTLCSource in ChannelMonitor

160d63d

Insert it in current_local_signed_tx, prev_local_signed_tx, remote_claimable_outpoints. For so get it provided by Channel calls to provide_latest_{local,remote}_tx

Return refs from build_commitment_transaction, removing clone()s

3b7ef49

Detect onchain timeout of a HTLC in ChannelManager block_connected

7499a4b

Pass failure backward

TheBlueMatt force-pushed the 2018-12-198-review branch from fdf4128 to 461501e Compare December 13, 2018 16:54

ariard reviewed Dec 14, 2018

View reviewed changes

TheBlueMatt and others added 12 commits December 14, 2018 16:46

Fail all pending HTLCs if the remote broadcasts a revoked tx

b9c609e

Generate PaymentFailed events for outbound payments we fail

badda94

Add const ACCEPTED_HTLC_SCRIPT_WEIGHT and OFFERED_HTLC_SCRIPT_WEIGHT to

b1e313f

ease readability Conditionnal compilation for weight of second one to handle test special cltv values

Add test for failing/fulfilling HTLCs from on-chain actions

0e9ac11

Including detection of timeout claims, fulfill claims, and failing all current HTLCs in case of revoked-commitment broadcast.

Add bigger test for failing HTLCs claimed through revocation

cdbd2ef

Typify payment_hash and payment_preimage

7a483e5

Fix variable name as payment_hash instead of txid for index of remote_hash_commitment_number in ChannelMonitor reader

Add logging of HTLC outputs resolved by remote peer justice tx

ea6e9a7

In case of broadcast of revoked local commitment tx, we may be interested that we've screwed up

Add additional log traces in channelmonitor/manager

f5ccd4b

Include the node id in ChannelManager test logs

150e9f0

Add constant for HTLC failure anti-reorg delay

d0dfaf8

Add some TODOs for correctness in ChannelMonitor

d56b479

TheBlueMatt force-pushed the 2018-12-198-review branch from 461501e to d56b479 Compare December 14, 2018 22:02

TheBlueMatt merged commit 832fc4f into lightningdevkit:master Dec 17, 2018

TheBlueMatt mentioned this pull request Jan 7, 2019

Check for timing-out HTLCs in remote unrevoked commitments #284

Merged

Detect HTLC-resolving on-chain actions and pass them to ChannelManager #269

Detect HTLC-resolving on-chain actions and pass them to ChannelManager #269

Uh oh!

Conversation

TheBlueMatt commented Dec 11, 2018

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Dec 13, 2018

Uh oh!

TheBlueMatt commented Dec 13, 2018

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment